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Abstract 

Despite increasing pressure for children to learn to write at younger ages, there are 
many unanswered questions about composition skills in early elementary school. 
The goal of this research was to examine the dimensionality of composition skills in 
kindergarten children, thereby adding to current knowledge about the measurement 
of young children’s writing and its component skills. The writing of 282 kindergar- 
ten children were assessed using three different scoring methods. Confirmatory fac- 
tor analyses were used to investigate the dimensionality of various methods of scor- 
ing. Results indicated that a qualitative scoring system and a productivity scoring 
system capture distinct dimensions of kindergartners’ compositions. A scoring sys- 
tem for curriculum-based measurement could not attain acceptable fit, which may 
suggest that CBM is ill-suited for capturing the important components of composi- 
tion for kindergartners. This study indicated that the measurement and components 
of composition in kindergarten may be qualitatively different from the compositions 
of older children. 


Keywords Component skills - Confirmatory factor analysis - Dimensionality of 
writing - Early writing - Kindergarten - Writing assessment 


Introduction 


The measurement of composition skills has grown increasingly important for 
educators and researchers due to pressure for children to write at young ages and 
an increased emphasis on data-driven decision making in the classroom. Despite 
the important role of writing in academic learning, there are still many open 
questions and a lack of consensus about how best to measure young children’s 
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composition. Composition refers to children’s ability to generate ideas for what to 
write and compose phrases, sentences, or texts in their writing using conventional 
or invented spelling. One important issue revolves around how to best capture 
children’s written output, or in other words, what components of written com- 
position are important. Some studies have reported or assumed that young chil- 
dren’s composition is accurately represented as a single, holistic component or 
score (e.g., Abbott & Berninger, 1993; Gansle, VanDerHeyden, Noell, Resetar, & 
Williams, 2006), whereas others have found that the writing of elementary school 
students contains many components, including macro-organization, productivity, 
complexity, and accuracy (Hall-Mills & Apel, 2015; Kim, Al Otaiba, Folsom, 
Greulich, & Puranik, 2014, Kim, Al Otaiba, Wanzek, & Gatlin, 2015; Puranik, 
Lombardino, & Altmann, 2008; Wagner et al., 2011). 

A second issue is what type of measurement best captures these important 
components of writing. Researchers have used many different methods to measure 
elementary school children’s compositions, including qualitative scoring systems, 
quantitative scoring, and curriculum-based measures (CBM; Abbott & Berninger, 
1993; Dockrell, Ricketts, Charman, & Lindsay, 2014; McMaster & Espin, 2007; 
Puranik & Al Otaiba, 2012; Wagner et al., 2011). These various scoring meth- 
ods may be capturing different components of composition ability. Consequently, 
there is little consensus regarding what composition is and how it is best assessed 
during early childhood. A greater understanding of the measurement of young 
children’s composition is necessary so that researchers can continue to appropri- 
ately assess writing, examine predictors of composition ability, and design inter- 
ventions for struggling composers. A greater understanding of measurement will 
also enable more accurate identification of students who are struggling and may 
allow practitioners to pinpoint the specific skills that must be supported. 

The primary aim of this study was to examine the dimensionality of writing in 
kindergarten children. To date, there have been no studies examining the dimen- 
sionality of composition in kindergarten children. Kindergarten is a year when 
children are just learning to write, including learning to write letters, spell words, 
and write sentences. Because they are just beginning to learn to write, their com- 
positions may be qualitatively and quantitatively different than the compositions 
of older children. This study attempted to add to the existing body of literature by 
replicating previous findings about dimensionality in older children, examining 
the dimensionality of a less-studied qualitative scoring system, examining alter- 
nate possibilities of the dimensionality of CBM, and extending the research to 
kindergarten compositions. 


Components of young children’s compositions 


Previously identified dimensions of composition include macro-organization, 
accuracy, productivity, and complexity. The findings of recent studies regarding 
the dimensions present in the writing of young children and the indicators of each 
dimension are discussed below. 
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Macro-organization 


The macro-organization of writing is generally considered to be the most important 
outcome or measurement of written composition (e.g., Kim, Al Otaiba, Wanzek, 
et al., 2015), perhaps because the writing of older students is judged primarily on its 
organization and content (e.g., ACT, 2018). The macro-organization component of 
writing is typically defined as the content of the ideas and the overall organization of 
the composition (Hall-Mills & Apel, 2015; Kim et al., 2014; Wagner et al., 2011), 
though some researchers include components such as word choice and sentence flu- 
ency (e.g. Kim et al., 2014). 


Accuracy 


The accuracy of the use of writing conventions is sometimes considered as a distinct 
component of writing. This component typically includes the accuracy of spelling, 
grammar, and mechanics such as punctuation and capitalization (Kim et al., 2014). 
This is also a component of composition that is considered in many standardized 
tests. For example, the Georgia Standards of Excellence where part of this research 
was conducted require kindergartners to “demonstrate command of the conventions 
of standard English grammar and usage when writing or speaking” and “demon- 
strate command of the conventions of standard English capitalization, punctuation, 
and spelling when writing,” (Georgia Department of Education, 2015, p. 5). 


Productivity 


Because of the difficulty of reliably rating the quality of young children’s writing, a 
common approach is to use writing productivity as an outcome for kindergartners 
and first graders (Kent, Wanzek, Petscher, Al Otaiba, & Kim, 2014; Kim et al., 2011; 
Puranik & Al Otaiba, 2012; Puranik, Al Otaiba, Sidler, & Greulich, 2014). Writing 
productivity is measured by counting the number of words, different words, ideas, 
clauses, or sentences (Berninger et al., 1992; Graham, Berninger, Abbott, Abbott, & 
Whitaker, 1997; Kent et al., 2014; Kim, Park, & Park, 2013; Puranik et al., 2008). 
For young composers, writing quality and productivity are related both conceptu- 
ally (Kim, Al Otaiba, Wanzek, et al., 2015) and empirically (Abbott & Berninger, 
1993; Kim et al., 2014; Nelson & Van Meter, 2007), but are nevertheless distinct 
constructs (Kim et al., 2014; Kim, Al Otaiba, Wanzek, et al., 2015; Wagner et al., 
2011). Their conceptual link stems from the fact that children who write greater 
quantities of text have more opportunities to convey complex, meaningful ideas. 


Syntactic complexity 


Syntactic complexity is another distinct component of children’s writing (Kim 
et al., 2014; Puranik et al., 2008; Wagner et al., 2011). There are multiple meth- 
ods for scoring syntactic complexity, but the most common methods rely on assess- 
ing the number or ratio of main and subordinate clauses in the composition. This 
may be a valuable component of writing because writers are expected to produce 
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compositions that include a variety of sentence structures. Previous studies have 
identified syntactic complexity as a separate dimension of children’s writing, albeit 
with slightly older children from 1* grade and above. The majority of the essays in 
this study contain either no complete clauses or only one. Since little variation is 
expected in these scores, the present study did not measure syntactic complexity. 


Methods for measuring young children’s compositions 


Ideas on how best to measure the important components of children’s writing vary. 
Historically, researchers often used a single score to capture the quality of written 
composition, though researchers vary in their definition of quality (e.g. Abbott & 
Berninger, 1993; Berninger et al., 1992; Graham et al., 1997; Olinghouse, 2008). 
More recently, researchers have turned to other analytical scoring methods such as 
quantitative scoring and CBM. These scoring systems, along with the components 
of composition they are hypothesized to capture, are discussed below. 


Qualitative scoring systems 


Several scoring systems have been developed to measure the quality of writing. One 
such scoring system is the 6+ | Traits rubric (Education Northwest, 2017). The 6+ 1 
Traits Rubrics system is widely used, freely available, and frequently researched 
(e.g. Gansle et al., 2006; Kim et al., 2014), however, there is limited data on its tech- 
nical adequacy and scoring reliability. The rubric contains seven different categories 
that each have criteria for scoring multiple aspects within that category. For exam- 
ple, the organization category contains criteria for scoring the quality of the com- 
position’s beginning, middle, and end; transitions; sequencing; and title. Although 
there are seven different categories, research indicates that the rubric captures two 
distinct dimensions of writing for first graders: scores for ideas, organization, word 
choice, and sentence fluency capture the macro-organization of the writing, whereas 
the spelling, mechanics, and handwriting categories capture the technical accuracy 
of the writing (Kim et al., 2014). 

Coker and Ritchey (2010) have proposed a similar but pared-down scoring sys- 
tem for scoring the quality of short writing samples from children as young as kin- 
dergarten. The categories of the scoring system were selected to reflect important 
features of writing that are reasonable expectations for young writers. Although it 
has not been studied as extensively as other methods of scoring, it shows promise 
as a quick and reliable measure of writing with acceptable criterion-related valid- 
ity compared to more established measures (Coker & Ritchey). Similar to the 6+ 1 
Traits Rubric, this scoring has five categories. 


Response type The response type category measures the completeness and com- 
plexity of composition. In kindergarten, children are typically graduating from 
writing single letters at a time to writing entire sentences, so the length and com- 
plexity of a short composition is a developmentally appropriate and sensitive 
measure of writing ability (Berninger, Fuller, & Whitaker, 1996; Coker & Ritchey, 
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2010). The inclusion of response type as a category is in line with the 6+ 1 Traits 
Rubrics because the sentence fluency category awards points for using a greater 
variety of sentences and more complex sentence structures. 


Relationship to prompt The relationship to prompt category measures whether 
the composition is related to the topic and how much the child elaborates on that 
topic. This is a key element for the macro-organization of the text, and variations 
of this category are present in nearly every qualitative writing scoring system. For 
example, in the 6+ 1 Traits Rubrics, children receive points in the ideas category 
for including a clear main idea with supporting details. Under the scoring system 
used by Hall-Mills and her colleagues, the organization category awarded points 
when compositions included a clear beginning and supporting details (Hall-Mills, 
2010; Hall-Mills & Apel, 2015). The scoring system used by Wagner et al. (2011) 
conceptualized this category slightly differently by awarding points based on the 
inclusion of a topic sentence and the number of key elements (main idea, body, and 
conclusion). Wagner and colleagues’ scoring system was appropriate for their age 
group (1st and 4th graders) and their prompt. 


Grammatical structure The grammatical structure category measures how many 
grammatical mistakes a child makes and how those mistakes impact the meaning 
of the sentence. Grammatical accuracy is a consideration in most measurement 
systems. Most researchers group this measure with the accuracy of writing (Edu- 
cation Northwest, 2017; Puranik et al., 2008) or include it as a unique dimension 
(Hall-Mills & Apel, 2015). 


Spelling Measurements of spelling accuracy are also included in nearly every 
writing scoring system. Kim et al. (2014) found that this measure fit on a factor 
that represented the accuracy of writing conventions. 


Mechanics Mechanical accuracy is also frequently measured in writing. Rating 
systems designed for more mature writers sometimes include more stringent crite- 
ria, such as correct capitalization of proper nouns and titles (e.g., Education North- 
west, 2017) whereas rating systems for younger less mature writers might include 
only capitalization of first letter in the sentence and correct ending punctuation 
(Puranik et al., 2008; Wagner et al., 2011). 


Productivity scoring system 
There is more agreement regarding the variables that should be included in pro- 
ductivity scoring systems. The most frequently included variable is words written 


(WW), others include the number of ideas, the number of different words, and the 
number of minimal terminable units. 
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Words written (WW) Although different systems of productivity measurement 
vary in which units they count, a measure of WW is included in nearly all of them 
(Hall-Mills & Apel, 2015; Kim et al., 2014; Kim, Al Otaiba, Wanzek, et al., 2015; 
Puranik et al., 2008; Tindal & Parker, 1989; Wagner et al., 2011). Under this sys- 
tem, words that are repeated throughout the composition are counted each time 
they appear. 


Number of ideas The number of ideas in a composition is the number of complete 
propositions, or subject-predicate pairs (e.g. Kim et al., 2011, 2014; Puranik et al., 
2008). This is an important supplement to the WW measure because it awards credit 
to writers who express complex ideas concisely. These writers might be recognized 
as less productive if WW is the only measure of productivity. 


Curriculum-based measures 


Demands for accountability and quantifiable learning in education have led edu- 
cators and researchers to develop measures of academic skills that are easy and 
quick to administer, that can be scored by teachers, and that can track students’ 
growth; these measures are commonly called curriculum-based measures (CBM; 
Hosp, Hosp, & Howell, 2007). CBM for writing involves scoring a short composi- 
tion according to the correct or incorrect word sequences (CWS, IWS).' A word 
sequence refers to a pair of two consecutive words or a consecutive word and punc- 
tuation mark. A correct word sequence (CWS) is a pair that is both contextually 
and grammatically correct (Hosp et al., 2007). Some have argued that CBM scores 
capture both production-dependent and production-independent aspects (Tindal & 
Parker, 1989). 


Correct word sequences (CWS) Although the number of CWS is related to the accu- 
racy of the writing conventions (for example, the number of words a child spells 
correctly partly determines the CWS score), Tindal and Parker (1989) have demon- 
strated that it is a production-dependent measure. In this sense, CWS may be con- 
sidered a measure of a writing productivity. Conversely, it may capture the writing 
fluency component hypothesized by Kim and colleagues (Kim, Al Otaiba, Wanzek, 
et al., 2015; Kim, Gatlin, Al Otaiba, & Wanzek, 2018). 


Percent of correct word sequences (%CWS) Many researchers choose to use scores 
derived from CWS rather than the raw scores themselves. For example, researchers 
have used the number of correct minus incorrect word sequences or the percent of 
correct word sequences (%CWS) out of total word sequences (McMaster & Espin, 
2007). Tindal and Parker (1989) considered %CWS to be a production-independent 
measure, conceptually distinct from CWS and other productivity measures. They 


' When CBM is administered as the only measure of composition, it sometimes includes a count of 
the WW and a count of the words spelled correctly or incorrectly. However, these measures are already 
accounted for in the quality and productivity scoring systems used in this study. 
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also found that %CWS was an indicator of the writing quality score, which they 
defined as a holistic score that captured “communicative effectiveness” (p. 175), and 
that comprised both macro-organization and technical accuracy. Because %CWS is 
determined primarily by technical accuracy (such as the correct spelling of two con- 
secutive words, or the correct usage of punctuation), it may be an indicator of techni- 
cal accuracy. Conversely, Kim and colleagues included %CWS in their dissociable 
CBM or writing fluency measure (Kim, Al Otaiba, Wanzek, et al., 2015; Kim et al., 
2018), separate from the macro-organization and productivity measures. Therefore, 
it is unclear whether %CWS is an additional measure of the technical accuracy of 
writing, or whether %CWS (together with CWS) captures a distinct component of 
writing that is dissociable from components such as accuracy and productivity. This 
study tested both possibilities. 


The present study 


The present study used essays written by kindergartners near the end of the school 
year to investigate how many dimensions exist when using multiple evaluation 
approaches/methods of scoring. The first research question examined how many 
dimensions are present in kindergarten compositions when a qualitative scoring 
system is used. For this study, we used a modified version of Coker and Ritchey’s 
(2010) quality sentence scoring rubric to rate compositions, as it has been shown 
to be developmentally appropriate for kindergarten children. As mentioned previ- 
ously, this qualitative scoring rubric awards points in five categories: response type, 
relationship to prompt, grammatical structure, spelling and mechanics. However, 
research is needed to determine its dimensionality. Like the more complex 6+ 1 
Traits Rubrics, its five categories may be separable into two distinct dimensions. 
Response type, relationship to prompt, and grammatical structure may capture the 
macro-organization of the composition, whereas spelling and mechanics may both 
measure aspects of the accuracy of writing conventions. The present study examined 
whether this qualitative scoring system captures two distinct dimensions of writ- 
ing (macro-organization and accuracy) or a single dimension (which may represent 
the overall quality of the writing). It was hypothesized that the qualitative scores 
would capture two distinct dimensions of writing in line with the findings of Kim 
et al. (2014) with first graders. An alternate possibility was that the scoring system 
would be best represented as a single factor that may capture the overall quality of 
the writing. 

In ratings for older students, the macro-organization factor generally includes 
inclusion of a topic sentence or key elements (e.g. story elements like plot, char- 
acter, and setting). However, this may be too stringent of an expectation for kin- 
dergartners, who often write only a sentence or less. Therefore, in the hypoth- 
esized two-dimensional model for kindergarteners, response type and relationship 
to prompt was expected to be related to macro-organization because these two 
categories are measuring a similar concept to what is measured in scoring sys- 
tems for more mature writers. The third category expected to be related to 
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macro-organization was grammatical structure. This is in contrast to the more 
conventional approach of including it with the measures of accuracy. Coker and 
Ritchey (2010) argue that severe mistakes or having many mistakes can compro- 
mise the meaning of the composition or make it impossible to decipher (see also 
Olinghouse, 2008). This is particularly true for young writers, who tend to have 
a high percentage of grammatical mistakes, and who tend to include much less 
context that can clear up confusion. This inherent link with the meaning of the 
composition, which in some ways is unique to the age group in this study, may 
make it a good fit for the macro-organization factor. Finally, spelling and mechan- 
ics were expected to measure the accuracy of writing conventions, as it does in 
most other rating systems (e.g. Kim et al., 2014; Wagner et al., 2011). 

The second research question examined how many dimensions are present in 
kindergarten compositions when a productivity scoring system is used in addi- 
tion to the qualitative scoring system. It was hypothesized that the productivity 
scoring system would capture a dimension of writing that was separate from the 
dimension captured by the qualitative scoring system. This would be consistent 
with previous studies of children’s writing (e.g., Kim et al., 2014, Puranik et al., 
2008, Wagner et al., 2011). An alternate possibility was that the productivity 
scoring system would capture the same dimension as the qualitative scoring sys- 
tem. In line with a great deal of previous research (e.g. Kim et al., 2014; Puranik 
et al., 2008; Wagner et al., 2011), the WW and Ideas measures were hypothesized 
to load onto a productivity factor that would be distinct from the factors of the 
qualitative scoring system. 

The third research question examined how many dimensions are present in kin- 
dergarten compositions when CBM is used for scoring. Although CBM has many 
useful properties, including the fact that it is quick and reliable to score and can 
capture growth across a school year, it is still unclear exactly which aspects of writ- 
ing CBM captures. Examining the dimensionality of two of its measures may help 
to illuminate exactly what educators and researchers measure when they use CBM. 
This is an important consideration, given its prevalence in research and in recom- 
mendations for educators (e.g. Deno, 2003; Hosp et al., 2007; Kim, Al Otaiba, 
Wanzek, et al., 2015; McMaster et al., 2011; McMaster & Espin, 2007; Tindal & 
Parker, 1989). 

One possibility is that the two CBM scores would capture separate dimensions 
of writing. These dimensions may represent productivity and accuracy, in line with 
Tindal and Parker’s (1989) findings of production-dependent and production-inde- 
pendent dimensions. Another possibility was that together, the CBM scores would 
capture a single dimension of writing (Kim, Al Otaiba, Wanzek, et al., 2015; Kim 
et al., 2018). Because of conflicting findings from previous studies, there was no a 
priori hypothesis about which of the two models would fit better. 

The fourth research question examined how many dimensions are present in 
kindergarten compositions when a qualitative scoring system, a productivity scor- 
ing system, and CBM are used for scoring. One possibility was that %CWS would 
capture the technical accuracy of the writing, whereas CWS would capture the pro- 
ductivity of the writing, resulting in a two-factor model. An alternate possibility 
was that the two CBM scores together would capture a dimension of writing that 
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is distinct from the other dimensions in this study, resulting in a three-factor model. 
There was no a priori hypothesis about which of the two models would fit better. 


Methods 
Participants 


The participants in this study were 281 kindergarten students recruited from pub- 
lic schools serving urban and suburban neighborhoods in the South and Midwest 
United States. 

These students attended 49 different classrooms, with each classroom having on 
average Six participating students. There was one additional child who participated 
in data collection but refused one of the essays. That child’s scores were not used for 
any of the analyses in this paper. 

Ninety seven percent (273) of the students’ parents returned a questionnaire con- 
taining demographic information. The average child age was about 6.1 years (range 
5.6—7.0 years). Other demographic information is presented in Table 1. 


Measures and procedures 


Human subjects’ approval was obtained from the Institutional Review Board prior 
to conducting this research. Kindergarten children wrote two essays on two sepa- 
rate days near the end of the school year (April or May). Writing took place in a 
convenient location at the child’s school, usually in a group of about six children. 
In the first prompt, the examiner instructed children to write about a special event 
(essay 1). The examiner introduced the writing topic using a script, saying “Today, 
you are going to draw and write about a special event in your life.” The script the 
examiner followed gave examples of special events (a special birthday or a special 
vacation), elicited an idea from each child, and asked for an additional detail from 
each child (such as “Who was there?’”’). Then the examiner also instructed children 
to try to keep writing for the entire time, to sound out words as best they could, and 
to cross out mistakes instead of erasing them. A different prompt was used for the 
second essay, although the structure of the instructions remained the same. In the 
second prompt, the examiner instructed children to write about something they were 
an expert on or knew a lot about (essay 2). The examples given were about topics 
that children might know a lot about such as lions, cars, or dinosaurs. An example of 
an additional detail the examiner asked for was, “What does it/they look like?’”’. 

Children spent 5 min independently drawing a picture and writing their essay. 
During the 5 min, they were not given any assistance with writing (including spell- 
ing). After 5 min, examiners asked children to read what they had written so that 
the examiner could write it in the margins of the paper. This aided scoring in cases 
where children had poor handwriting or spelling. 
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Table 1 Demographic characteristics of sample 


Variable 


Gender 


Ethnicity 


Child home language 


Highest education of mother 


Annual family income 


Responses N 
Male 128 
Female 144 
No response 1 
Black/African American 57 
White/Caucasian 188 
Hispanic or Latino 4 
Asian (Indian) 4 
Biracial or Multiracial 12 
Other 2 
No Response 3 
English only 267 
English and other language 4 
Other language Only 

No response 1 
Less than high school diploma 14 
High school diploma 40 
Post high-school training 48 
Two-year degree 12 
Four-year degree 84 
Graduate degree 74 
No response 1 
$20,000 or less 33 
$20,001-$40,000 41 
$40,001-$60,000 20 
$60,001-$85,000 38 
$85,001 or more 137 
No response 4 
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Percent 


46.9 
2.7 
4 
20.9 
68.9 
1.5 
1.5 
44 
7 
ml 
97.8 
1.5 


Essay scoring 


Each essay was scored for quality (qualitative indicators, including response type, 
relationship to prompt, grammatical structure, spelling, and mechanics), produc- 
tivity (WW and ideas) and CBM (CWS and %CWS). Coker and Ritchey’s (2010) 
original scoring system was slightly modified to fit the different task requirements 
and prompts used in this study. The first modification was related to the grammati- 
cal structure category and was necessary due to the length of the writing samples 
in this study. Coker and Ritchey asked participants to write two sentences about a 
prompt, whereas in this study children wrote longer compositions. Coker and Ritch- 
ey’s grammatical structure category awards two points to sentences that contain a 
single grammatical error and one point to sentences with more than one error. Since 
the participants in this study sometimes wrote longer compositions, the scoring 
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system in this study allowed two points for compositions even if they contained mul- 
tiple grammatical errors, provided that the errors did not comprise more than 50% 
of the writing sample and did not have a major effect on the meaning. The second 
major modification was of the relationship to prompt category. Coker and Ritchey’s 
scoring system was designed to score sentences about any prompt. However, in this 
study, the scoring systems needed to apply to only two topics, and needed to capture 
differences in slightly longer compositions. Therefore, the scoring system used in 
this study maintains the spirit of Coker and Ritchey’s scoring system by awarding 
points for details that are appropriately related to the prompt, while being more spe- 
cific to the respective prompts to maximize scoring reliability. 


Inter-rater reliability Four graduate research assistants (GRAs) were trained exten- 
sively by the first author on kindergarten essays from a previous study until they 
reached a reliability of 80%. For scoring the essays in this study, GRAs worked in 
pairs to score the essays for quality (response type, relationship to prompt, gram- 
matical structure, spelling, and mechanics), productivity (WW and ideas), and CBM 
(CWS and %CWS). Each assessment was individually scored by two GRAs, and 
reliability was calculated for each measure. Because the qualitative scoring system 
had four possible scores for each category, it was treated as an ordinal measure and 
reliability was measured with Cohen’s kappa. Interrater reliability ranged from .71 to 
.96. Because productivity (WW and ideas) and CBM (CWS and %CWS) scores are 
continuous measures, reliability was measured by intraclass correlation coefficient 
(ICC); reliability ranged from .92 to 1.0. Following the reliability calculations, each 
GRA pair compared scores and came to an agreement about any discrepancies before 
recording the final score. 


Analytic strategy 


Preliminary statistics for this analysis (such as normality tests and correlations) were 
conducted in RStudio (RStudio Team, 2016). Modeling analyses were performed 
with Mplus, version 8.1 (Muthén & Muthén, 2017). Given that students were nested 
within classrooms, all our analyses accounted for the nested nature of the data using 
cluster-corrected standard errors. 

Most of the analyses presented in this paper contain some ordinal indicators, spe- 
cifically the scores from individual categories of the qualitative scoring system. Tra- 
ditional maximum likelihood (ML) estimation does not perform well in confirma- 
tory factor analysis (CFA) with ordinal data, so weighted least squares means and 
variances (WLSMYV) estimation was used for most of the analyses, as recommended 
by Finney and DiStefano (2013) and Bandalos (2014). When used with ordinal data 
with four categories, WLSMV is more likely to result in unbiased parameter esti- 
mates compared to ML or robust ML estimation (Bandalos). The only exception was 
the third research question, which contained only continuous indicators and there- 
fore did not require WLSMV. Instead, the analyses for Question 3 used robust maxi- 
mum likelihood estimation (MLR). MLR is also robust to nonnormal data (Brown, 
2015), and this was important for this study because skewness and kurtosis tests 
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(performed in the moments package of R; Komsta & Novomestky, 2015) were sig- 
nificant for several variables. 

For each question, model fit was examined with model-fit statistics and Chi 
square tests of difference (for questions with categorical data, Chi square tests 
were performed with the DIFFTEST option in Mplus; Muthén & Muthén, 2017). 
The Chi square test was important for determining the relative fit of models, that 
is, which model fit better than another. However, Chi square values are sensitive to 
sample size (Kline, 2016; Marsh, Hau, Balla, & Grayson, 1998). In addition, CFI, 
TLI, and RMSEA were used to evaluate model fit. Because the measurement of 
complex skills such as writing is somewhat unreliable by nature for kindergartners 
(e.g., McMaster & Espin, 2007), less conservative rules of thumb were employed 
for determining reasonable model fit. Specifically, values of about .90 or higher for 
CFI and TLI and about .10 or lower for RMSEA were deemed reasonable (Browne 
& Cudeck, 1993; Hu & Bentler, 1999). All the initial models had poor fit, so after 
determining whether the more parsimonious or less parsimonious model fit better 
based on theory, modification indices were also examined to highlight areas with 
poor local fit and suggest improvements (Brown, 2015). When the modification 
indices suggested freely estimating parameters that were theoretically sensible, these 
parameters were added one at a time to the better-fitting model. 

There were two main types of parameter additions that made sense theoretically. 
The first was correlation between the errors of two indicators from the same essay 
(for example, relationship to prompt and response type for Essay 1). It is reasonable 
to think that because these indicators were based on a single essay, their error vari- 
ances would be related. The second theoretically sensible correlation was between 
the errors of the same measure for different essays (for example, mechanics for 
Essay | with mechanics for Essay 2). This is another reasonable suggestion, because 
children who use good or poor punctuation or capitalization in one essay are likely 
to do so on the second. 


Results 
Descriptive statistics 


Means and standard deviations for each measure are presented in Table 2. Descrip- 
tive data indicated that on average, children wrote nine words for the special event 
essay and about eight words for the expert essay. The CBM scores indicated a great 
deal of variability in both CWS and %CWS. Due to the fact that neither of these 
variables could be lower than 0, both had strong positive skews; however, the esti- 
mation methods used in the CFAs of this study are capable of handling non-normal 
data. Examination of the pattern of scores on the qualitative scoring systems did not 
reveal any distinct patterns whereby one category was easier in one essay than the 
other. 

Correlations between the observed variables are presented in Table 3. Cor- 
relations between two continuous variables were Pearson correlations; correla- 
tions between a continuous and a categorical variable or between two categorical 
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Table 2 Descriptive statistics for composition measures 


Measure Essay | (special event) Essay 2 (expert) 
Score Score frequency Score Score frequency 
Response type 0 25 0 35 
1 45 1 38 
2 135 2 145 
3 76 3 63 
Relationship to prompt 0 129 0 133 
1 63 1 78 
2, 42 2 38 
3 47 3 32 
Grammatical structure 0 61 0 63 
1 19 1 31 
2 106 2 104 
3 95 3 83 
Spelling 0 31 0 39 
1 76 1 89 
2 158 2 137 
3 16 3 16 
Mechanics 0 50 0 72. 
1 140 1 129 
2 70 2 67 
3 21 3 13 
Measure Mean SD Mean SD 
Words written 9.00 6.13 8.11 5.83 
Ideas 1.36 1.23 1.32 1.17 
CWS 3.90 4.36 3.04 4.13 
%CWS 33.65 25.18 27.48 24.42 


N= 281 for all scores for both essays 


variables were Spearman correlations, as this type of correlation is more appropriate 
for categorical data (Rugg, 2007). With a few exceptions that are discussed in more 
detail below, correlations were small to moderate. 


Dimensions in the qualitative scoring system 


Two alternative CFA models were fit to test the dimensionality of the qualitative 
scoring system: a one-factor model, in which the scoring system was unidimensional 
with all five indicators for both essays (10 total indicators) loading onto a single 
factor, and a two-factor model with two dimensions: macro-organization (relation- 
ship to prompt, response type, and grammatical structure) and accuracy of writing 
conventions (spelling and mechanics). The fit for both models was poor, therefore 
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Table 3 Correlations between observed variables 


(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) 


1. Response Type 1 - 

2. Response Type 2 45 

3. Rel. to Prompt 1 51 .33 - 

4. Rel. to Prompt 2 33 re) 37 - 

5. Gram. Structure 1 57 38 34 31 - 

6. Gram. Structure 2 27 65 25 Al 34 - 

7. Spelling 1 42 33 27 28 43 29 - 

8. Spelling 2 31 45 16 2a 31 Al 39 - 

9. Mechanics | 51 40 28 23) 42 .20 42 28 - 

10. Mechanics 2 34 56 19 34 .26 39 32 42 50 - 
11. Words Written 1 15 Al 251 32 32 18 38 28 38 30 
12. Words Written 2 45 69 35 46 24 30 30 40 30 43 
13. Ideas 1 .84 43 46 31 A8 24 37 27 5 JD 
14. Ideas 2 A8 82 37 eS) 28 55 30 38 36 A5 
15. CWS 1 63 43 36 34 44 21 70 43 58 43 
16. CWS 2 42 .60 24 35 36 AT 46 71 39 61 
17. CWS 1 39 33 16 28 44 2d Ab} Al 54 38 
18. %CWS 2 34 AT 1S 25 34 51 43 74 33 .60 


(11) (12) (13) (14) (15) (16) (17) (18) 


11. Words Written 1 - 
12. Words Written 2 58 - 


13. Ideas 1 83 54 - 

14. Ideas 2 50 -80 54 - 

15. CWS 1 74 52 -66 -50 - 

16. CWS 2 43 -65 42 -62 61 - 

17. %CWS 1 29 28 29 28 .73 8 - 

18. %CWS 2 27 34 25 37 49 79 56 - 


Correlations in boldface are Pearson correlations; standard print are Spearman correlations. All correla- 
tions are significant, p<.01 


Rel. to Prompt relationship to prompt, Gram. Structure grammatical structure, CWS correct word 
sequences, C/WS correct minus incorrect word sequences; %CWS percent correct word sequences 


we decided to modify the model based on modification indices and theoretical con- 
siderations. There were three important theoretical considerations. First, previous 
research has shown that young children’s text generation is significantly constrained 
by their transcription skills (Graham et al., 1997; Puranik & Al Otaiba, 2012). Sec- 
ond, Coker and Ritchey (2010) reported that the sentence writing quality score taps 
a unitary dimension in kindergarten students’ written performance. Finally, we cal- 
culated Cronbach’s alpha for the quality score of the two essays to investigate the 
degree to which items assessed a single construct. Internal consistency reliability for 
Essay | was .99 and for Essay 2 was .82. 
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Fig. 1 Standardized factor loadings of the final model for the qualitative scoring system. Gram. Structure 
grammatical structure, Rel. to Prompt relationship to prompt 


Adding the modifications suggested in the modification indices rapidly increased 
the correlation between the two factors, and many of the modification indices sug- 
gested cross-loading indicators on both factors. For these reasons, it seemed prefer- 
able to retain the one-factor model and use the modification indices and theoretical 
considerations to improve it. The final one-factor model for the qualitative scor- 
ing system is depicted in Fig. 1, and model fit statistics are presented in Table 4: 
¥°(31) = 137.071, CFI=.951, TLI=.929, RMSEA =.11 (.092, .129). 


Dimensionality of quality and productivity 


The second research question examined whether the qualitative scoring system 
and the productivity scoring system represent a single dimension or distinct 
dimensions of kindergarten composition. Preliminary data screening revealed 
a problematically high correlation between the response type indicator and the 
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Table 4 Overall model fit indices for final models 


Research _- Figure Model description x’ (ff, p) CFI (TLD RMSEA (90% 
question number confidence inter- 
vals) 

1 - Initial one-factor model 226.757 911 (.886) 139 

(35, <.001) (.122-.157) 
1 - Initial two-factor model 205.408 920 (.895) 134 

(34, <.001) (.116-.152) 
1 1 Final one-factor model 137.071 951 (.929) 110 

(31,<.001) (.092—.129) 
2 - Initial one-factor model 308.862 798 (.753) 129 

(54, <.001) (.116-.144) 
2 - Initial two-factor model 236.661 855 (.819) nal 

(53,<.001) (.097—.125) 
2 2 Final two-factor model 170.554 .905 (.874) .092 

(50, <.001) (.077—.108) 
3 - One-factor model (final) 40.090 829 (.487) .260 

(2,<.001) (.193-—.333) 
3 - Two-factor model 37.740 835 (.010) 361 

(1,<.001) (.268-.464) 
4 - Initial two-factor model 596.936 -732 130 

(103, <.001) (.687) (.120-.141) 
4 - Initial three-factor model 596.795 731 132 

(101, <.001) (.680) (.122-.142) 
4 3 Final two-factor model 331.231 869 .097 

(90, <.001) (.825) (.086-—. 109) 


CFI comparative fit index, RMSEA root mean square error of approximation 


ideas indicator for each essay, above r=.80 in both cases. In theory, these two 
measures are closely related but not identical. Response type captures the com- 
pleteness of the response, with one point awarded for having one to several 
words and up to three points awarded for multiple sentences or a complex sen- 
tence. Ideas is a measure of how many complete propositions exist in the writing. 
Therefore, response type is a more lenient indicator in that it awards points for a 
lower standard (such as a few words that don’t make a complete sentence); how- 
ever, it has a maximum of three points. Thus, a composition with several com- 
plete sentences would receive the same score as a composition with two complete 
sentences. Conversely, the ideas measure does not award points for incomplete 
sentences, but it can award a theoretically infinite amount of points for compo- 
sitions with more complete propositions. With our sample, these two measures 
were practically identical. There were many compositions that contained a few 
words but no complete sentences. However, there were few that exceeded two 
complete sentences or a single complex sentence. Thus, there was not sufficient 
variation at the higher end of the spectrum to make ideas a distinct indicator. 

The close relationship between these two variables resulted in difficulty 
in model convergence. When variables are too highly correlated, it is best to 
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Fig. 2 Standardized factor loadings of the final model for the qualitative scoring system with productiv- 
ity indicators. Gram. Structure grammatical structure 


combine or drop one of them (Kline, 2016). Response type was dropped from the 
models because the ideas indicator is more widely represented in writing research 
(e.g. Kim et al., 2014; Puranik et al., 2008; Wagner et al., 2011) than the response 
type indicator. Furthermore, it seemed better to retain the indicator that had a 
larger possible range of values. Additionally, dropping the ideas indicators would 
have resulted in only two indicators (the WW indicators) loading onto the produc- 
tivity factor. Two-indicator factors can be problematic for identification, and they 
can be problematic because they allow more measurement error (Kline, 2016). 
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The model in which productivity was a unique factor had significantly better fit 
than the one-factor model, y7(1) =91.28, p<.001, but the fit of both models was 
poor. The addition of several theoretically sensible correlated error terms that were 
suggested by the modification indices improved the model fit to acceptability. The 
final model with standardized factor loadings is depicted in Fig. 2, and model fit 
statistics are presented in Table 4: ¥7(50) = 170.554, CFI=.905, TLI=.874, 
RMSEA = .092 (.077, .108). 


Dimensionality of CBM indicators 


The purpose of Research Question 3 was to examine the dimensionality of the 
CBM indicators when they are used independently of other measures. This analy- 
sis compared the fit of a unidimensional and a two-dimensional model. The fit for 
both models was unacceptably poor. Modifications to improve the model fit were not 
attempted because the model had only two degrees of freedom. Model fit statistics 
are presented in Table 4. 


Dimensionality of all composition measures 


The purpose of Research Question 4 was to build on the findings of the previous 
models and additionally determine the best model for accommodating the CBM 
indicators. Two models were fit. In the first model, CWS (a production-dependent 
CBM measure) loaded onto the productivity factor, along with WW and ideas, 
whereas %CWS loaded onto the writing quality factor, along with qualitative scor- 
ing system indicators. In the second model, both the CWS indicators and the %CWS 
indicators loaded onto a CBM factor that was distinct from the productivity and 
quality measures. For the analyses, response type indicators were not included in 
either model because of previously-discussed problems with collinearity between 
response type and ideas. The fit of both models was unacceptably poor. Modifica- 
tions suggested in the modification indices were added to both models until each 
approached the minimum reasonable fit statistics. However, this required the addi- 
tion of many correlated error terms for both models. The mediocre fit and high num- 
ber of correlated errors in this model suggest that additional research may be needed 
to answer this question satisfactorily. 

The results of these analyses may tentatively suggest support for the two-factor 
model for several reasons. When evaluating the parsimony of the models for the fourth 
research question, the two-factor model is preferable to the three-factor model. It has 
fewer dimensions, and it required two fewer correlated error terms to achieve medio- 
cre fit. Lastly, the two-factor model is slightly more interpretable than the three-factor 
model, both because it has fewer correlated error terms and because it gives inherent 
meaning to the CBM scores. As an additional consideration, the three-factor model 
required several correlated error terms between CBM indicators and the indicators from 
other factors. The fact that CBM indicators may share additional variance with indica- 
tors from other factors suggests that they may fit better when modeled as loading onto 
those other factors. Model fit statistics are presented in Table 4, and the final two-factor 
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Fig. 3 Standardized factor loadings of the final two-factor model that includes indicators from all three 
scoring systems. Rel to Prompt relationship to prompt, Gram. Structure grammatical structure, %CWS 
percent correct word sequences, CWS (number of) correct word sequences 


model is presented in Fig. 3: y°(90)=331.231, CFI=.869, TLI=.825, RMSEA =.097 
(.086, .109). The results of this model should be interpreted with extreme caution, 
because the fact that so many parameters needed to be added to the model to achieve 
even a mediocre fit suggests that the theoretical model may have been a poor starting 
point for modeling the data. 
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Discussion 


In addition to expectations regarding writing letters, spelling words, and writing 
sentences, Common Core State Standards (Common Core Standards, 2010) also 
contain expectations for composing text. Kindergarten children are expected to 
use a range of compositional methods including drawing, dictating, and writing 
to narrate a single event or several linked events with some details about events 
and a sense of closure. They are expected to compose informative text and opin- 
ion pieces in which they introduce the topic they are writing about and state an 
opinion about the topic (CCSS, 2010). Yet, we do not have reliable and valid 
measures for scoring compositions in young, beginning writers in kindergarten. 
This study contributes the literature on the assessment of composition ability of 
young, beginning writers, because to date there is little research on the assess- 
ment of composition in kindergarten children. This study helps to clarify the 
dimensionality of a promising qualitative scoring system for compositions that 
could be particularly useful to teachers because it is quick to administer. Addi- 
tionally, this study replicates the finding that writing quality and writing produc- 
tivity are closely related but are nevertheless distinct measures. 

This study attempted to replicate Kim and colleagues’ (2014) finding with 
first graders that a qualitative scoring system comprises two distinct dimensions. 
Despite the fact that the new scoring system used in this study measures similar 
constructs to the 6+1 Traits Rubric that was used in Kim’s study, the new scor- 
ing system was best modeled as unidimensional. Our findings suggest that the 
five aspects of the adapted qualitative scoring system cohere to capture a sin- 
gle dimension of substantive quality in the sample of kindergarten children. The 
dimension captures young children’s ability to generate ideas, respond appropri- 
ately to the prompt, and use appropriate grammatical structures and transcription 
skills such as spelling and mechanics. Coker and Ritchey (2010) also reported 
similar results regarding undimensionality when using a quality rubric to measure 
sentence writing and concluded that the qualitative score ‘assesses multiple profi- 
ciencies of beginning writing using a single indicator’ (p. 189). 

Given the similarities between the present qualitative scoring system and the 
scoring system used in Kim et al.’s (2014) study, it seems possible that there are 
substantial differences between the composition abilities of kindergartners and 
those of the first graders in Kim’s study. The high correlation between the accu- 
racy and macro-organization factors in the present study indicates that measures 
of these two factors covary to such a high degree that they cannot be separated; 
children with high accuracy almost always have high macro-organization, and 
children with poor accuracy almost always have poor macro-organization. This 
could be related to the fact that most of the children in this study wrote short 
compositions, rarely longer than a couple of sentences. Children who wrote short 
compositions had few chances to demonstrate technical accuracy. For example, a 
child who did not write a complete sentence would have received a low score on 
all three of the macro-organization indicators; although the child may have been 
able to achieve a high spelling score, he or she would have been unable to score 
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above one point for mechanics, because a score of two or higher requires the cor- 
rect use of punctuation, which is almost always a sentence-ending period. Few 
children used commas or other punctuation in their composition. 

An alternative but less likely explanation is that this closer link between technical 
accuracy and macro-organization may be an artifact of the particular scoring sys- 
tem that was used in the present study. Extant research has clearly indicated that the 
quality of young children’s writing is significantly constrained by their transcription 
skills such as spelling and handwriting (Graham et al., 1997, Puranik & Al Otaiba, 
2012). In other words, macro-organization is largely constrained by technical accu- 
racy. Consequently, a more plausible explanation of our results of unidimensionality 
is that these results may indicate a stronger constraining influence of transcription 
skills for the kindergarten children in this sample compared to the first graders in 
Kim’s sample (see Berninger, Mizokawa, & Bragg, 1991, for an explanation of the 
developmental constraints hypothesis in composition). For kindergarteners, quality 
is so constrained by transcription that it is indistinguishable. 

In line with previous research, this study demonstrated that the dimensions of 
productivity and quality of writing comprise two distinct but correlated dimensions 
even for young, beginning writers. As Kim, Al Otaiba, Wanzek, et al. (2015) argue, 
there is a conceptual link between productivity and quality in writing. There is a 
certain amount of text (productivity) that is required in order to fully convey an idea 
(quality), and the more text that is included in a composition, the more opportunity 
there is to expand on ideas and organize them well. However, some students may be 
relatively verbose writers without necessarily adding to the quality of their piece. In 
this sample, this was sometimes the case with students who wrote a great deal about 
a topic unrelated to the prompt, or who wrote about multiple topics that were unre- 
lated to each other. Thus, productivity and quality are both conceptually related and 
conceptually distinct. 

Interestingly, when indicators from the productivity scoring system were included 
in the model with indicators from the qualitative scoring system, one of the catego- 
ries from the qualitative scoring system (i.e. the response type category) was cor- 
related with one of the productivity indicators (i.e. the ideas indicator) to such a 
degree that it had to be dropped from the model to prevent model estimation prob- 
lems. This indicates that the particular qualitative scoring system used in this study 
may also have measured some aspects of writing productivity. These characteris- 
tics may make the qualitative scoring system particularly useful for educators who 
want to quickly and easily get a big-picture view of a child’s composition ability. 
The scoring system may also be useful for progress monitoring or placing children 
in ability groupings, as is the suggested use of CBM. However, unlike CBM, this 
scoring system can capture aspects of the content of children’s writing, such as how 
closely related the composition is to a prompt. Furthermore, the scoring system cat- 
egories are more inherently meaningful than CBM indicators. If a teacher sees that 
a child’s compositions consistently receive a low score in a particular category, the 
teacher can plan instruction about (for example) including additional details in writ- 
ing. Conversely, when a child’s composition consistently receives low CBM scores, 
it is impossible to tell from the CBM score exactly which aspects of writing should 
be targeted. 
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When CBM indicators were added to the previous models, the models could only 
achieve mediocre fit with the addition of many correlated error terms. Given the 
poor fit of the data and the inability to use a direct statistical comparison between 
these models, it is difficult to draw conclusions about which model has better fit. 
However, the results of this study tentatively suggest support for the two-factor 
model over the three-factor model for several reasons. The first reason is that the 
three-factor model required a high number of cross-dimension factors in order to 
achieve mediocre fit, including many correlated errors between the CBM indicators 
and the indicators for productivity and quality dimensions. This may suggest that the 
CBM indicators share too much variance with the indicators from other factors to be 
modeled separately. The second reason that the two-factor model may be preferable 
is that it gives more meaning to the CBM indicators. If a scoring system measures 
something about writing that is distinct from the components that researchers and 
educators consider important (such as quality), it is less useful than a scoring sys- 
tem that measures a meaningful component. Considering CBM scores as indicators 
of meaningful components of writing, such as quality and productivity, rather than 
considering them as indicators of nothing more than an overall CBM score, assigns 
the indicators meaning and makes the model more easily interpretable. Of course, 
due to the relatively poor fit of the models from this paper, future research is neces- 
sary to determine how well these CBM indicators actually measure the meaningful 
components of writing (if at all). Choosing a more interpretable model is not useful 
if the model does not actually represent the data well. 

Previous researchers have questioned the reliability of CBM for young writers 
(e.g. McMaster & Espin, 2007), despite its prevalence. This questionable reliability 
may have been one source of the trouble with model fits in this study, particularly 
since CFA depends on having reliable measures (Kline, 2016). The present study 
attempted to control for error of measurement by including two essays and sev- 
eral measures of each construct, but these attempts were apparently not sufficient 
for improving an already error-prone measure. The reliability of CBM for writing 
is highly dependent on the number of words students generate (Jewell & Malcki, 
2005). This may have been the other source of the trouble since the kindergarten 
children in our sample wrote few words. Coker and Ritchey (2010) also reported this 
issue in their study with kindergarten children even when children were only pro- 
ducing sentences. Criterion validity for CBM measures for sentence writing ranged 
from .20 to .30, and were lower that the criterion validity for the Quality and pro- 
ductivity scores. Thus, whereas CBM measures may be appropriate for measuring 
letter writing, sound spelling, spelling, and sentence writing in kindergarten children 
it appears to be less appropriate for measuring the composition abilities of kinder- 
garten writers. 

Indeed, all of the models in this study had relatively low values for model 
fit indices, and in most cases, even the final models achieved only mediocre fit 
(Browne & Cudeck, 1993; Hu & Bentler, 1999). The final fit was lower than the 
minimum values that have been recommended by other experts (e.g. Nye & Dras- 
gow, 2011; Yu, 2002). If the more conservative cut-offs for fit indices had been 
pursued, the models would have included many parameters that were not speci- 
fied a priori, and this could risk capitalizing on chance associations present in this 
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particular sample but not necessarily representative of the population (Brown, 
2015). Conversely, an approach that was more conservative with adding param- 
eters to the models would have resulted in rejecting each model outright, provid- 
ing little additional information for future researchers. Instead, this paper sought 
to strike a balance between finding a model that was empirically supported by 
the present data and finding a model that was similar to the models supported by 
previous studies. 

These challenges reflect the difficulty of assessing writing in general, and they 
underscore the difficulty of assessing writing in young, emerging writers who have 
limited to modest writing abilities. It also highlights the complexity of assessing 
writing; useful writing assessments in one grade may not be useful in another grade. 
This has been demonstrated in this research, in which CBM scores that have been 
modeled acceptably in other primary grades (Kim, Al Otaiba, Wanzek, et al., 2015; 
Kim et al., 2018) could not be acceptably fit to the same model for kindergartners. 
Indeed, the fact that a single method for scoring writing cannot be used in all grades 
has been shown by other researchers. For example, Jewell and Malecki (2005) found 
that certain CBM scores were strong predictors of qualitative measures of writing 
for second-grade students, but not for fourth- and sixth-grade students. Similarly, 
Parker, Tindal, and Hasbrouck (1991) found that certain CBM scores were suitable 
as screening measures for struggling writers in fourth grade, but not in the second 
and third grades. Taken together, the results of these studies indicate that what we 
know about writers in one grade may not apply to writers in another grade. Accord- 
ingly, what we know about first graders, who are also young, beginning writers, does 
not apply to kindergarten students. 

It is clear that we need continued research to determine the best approach to 
measure composition skills in young children. These assessments might need to be 
grade specific to capture developmental competencies. The popularity of holistic/ 
analytic rubrics stems mainly from their convenience and general reliability of scor- 
ing; however, they are not perfect. Whereas holistic scoring rubric such as the qual- 
ity scoring rubric used in this study may be efficient and convenient for scoring writ- 
ing, they also appear to be designed for the majority (i.e., the average student) and 
may be less sensitive to writing features displayed by an above average student. Let’s 
take an example of a child who attempts to express a complex thought. To do so, he/ 
she may attempt to spell a difficult word which in turn could lead to more spelling 
errors and a lower score on the spelling dimension compared to the child who used 
more simple words but spelled them correctly. Perhaps adding a category to meas- 
ure word choice as is used in the 6+1 rubrics, may be useful. Therefore, one very 
important avenue for future research is the creation of measures that are develop- 
mentally sensitive to features displayed by good, average, and poor writers. 

Another important avenue for future research is designing a quality rubric that 
is more sensitive to textual features, a point raised previously by other researchers 
(e.g., Huot, 2002). Holistic scoring systems have an identical range of scores allo- 
cated across the various categories measured. In the quality rubric used in this study, 
scores across all five categories examined were rated on a score of 0-3. However, 
some categories might be better scored on a larger range of scores based on develop- 
mental writing expectations. 
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Until such a time, based on the results of the present study including the inter- 
nal consistency reliability and findings of other research on writing in kindergarten 
children (e.g., Coker & Ritchey, 2010; Ritchey, 2008), a qualitative and productivity 
scoring system appear to be reasonably sufficient for measuring kindergarten com- 
position writing. These measures are easy and quick to administer and score, which 
is an important consideration for school-based research and in-classroom assess- 
ment. It allows teachers to make judgements about overall writing quality based on a 
number of dimensions; dimensions that previous researchers have been found to be 
important to evaluate. Consequently, a teacher could direct instructional attention to 
these important aspects of writing (productivity and quality, including accuracy and 
macro-organization) that a student might be struggling with. A look at our data indi- 
cates that the majority of the children obtained a score of zero on the ‘relationship 
to prompt’ criteria. Per CCSS standards, kindergarten children are expected to intro- 
duce the topic they are writing about and state an opinion about the topic (CCSS, 
2010). Clearly our data indicate that many students (approximately 46%) are unable 
to do so. This may be a dimension of writing that kindergarten teachers may need to 
specifically focus on during writing instruction. 

Finally, because young children’s ability to produce written text is severely con- 
strained by their transcription skills, perhaps eliciting ideas orally may reveal organi- 
zational capacities that are obscured by tasks that require the production of text. If 
students exhibit difficulties with generating ideas and organizing thoughts, instruc- 
tion could focus on these two elements without the additional burden of writing. 
Once students are able to generate ideas and organize text, teachers could further 
support the writing process by helping students spell words or forming letters. 


Limitations and Directions for Future Research 


This study has raised several interesting questions about kindergartner’s composi- 
tion ability that cannot be fully explored with the present data. For example, collect- 
ing three or more compositions from children would have allowed for method effects 
to be included in the model. Including these method effects may have allowed 
clearer conclusions to be drawn about dimensionality because fewer correlated error 
terms would have been required. For example, given the inherent, theoretical link 
between quality and productivity (as well as links between other measures), it may 
have been beneficial to assume that the quality indicators and productivity indicators 
of a particular essay would be related, over and above the relation between the qual- 
ity indicators from multiple essays by the same participant. Being able to model this 
relationship with a multi-trait, multi-method model may have significantly improved 
model fit. However, these types of models require either more than two measure- 
ments or stringent assumptions about the structure of the data (Brown, 2015; Wida- 
man, 1985) that may have been unmerited in this case. 

The discrepancies between the present paper and previous findings were unex- 
pected given the wealth of research supporting similar factors structures for the com- 
positions of slightly older writers (e.g. Hall-Mills & Apel, 2015; Kim et al., 2014; 
Kim, Al Otaiba, Wanzek, et al., 2015; Wagner et al., 2011). There are several possible 
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explanations for these differences. The first is that in young children, measurements 
of complex skills like composition may be inherently error-prone. This problem can 
sometimes be circumvented by taking several measurements within a short time span 
so that additional measures of each indicator type can be included in the model. A sec- 
ond possibility is that the composition skills of kindergartners are qualitatively differ- 
ent, so any model of kindergarten composition that is based on models of older ele- 
mentary school students’ composition will be a poor fit. Future researchers may benefit 
from taking these considerations into account when planning studies. 

Children were given a prompt and a short span of time to write about the prompt. 
This means that any conclusions drawn about the dimensionality of composition ability 
may only apply to children’s ability to compose spontaneously over a short time frame. 
This is one of the most common methods of measuring composition ability for young 
children (e.g. Abbott & Berninger, 1993; Graham, Harris, & Fink, 2000; Kent et al., 
2014; Kim, Al Otaiba, Wanzek, et al., 2015; Wagner et al., 2011), probably because 
it may give the purest picture of a child’s independent ability. Including scores from a 
child’s compositions for school assignments may provide an interesting supplement to 
future research. 

Finally, the demographic make-up of the sample of children who participated in this 
study must be acknowledged. The sample was predominantly White. They were typi- 
cally developing children who came from predominantly English-speaking homes with 
higher than average family income. Future research should attempt to include a more 
diverse group of children (e.g., children with writing disabilities, English language 
learners) to improve external validity of results. 

In conclusion, this study contributes additional knowledge in the field of writing 
assessment by revealing two dimensions of children’s written composition for kinder- 
garten students: quality and productivity. It reinforces the usefulness of both qualitative 
and quantitative scoring systems for compositions for certain educational and research 
purposes, and it raises questions about the usefulness of CBM in certain situations. In 
an era when children are expected to read and write at increasingly younger ages, cor- 
rect understanding of the measurement of writing is imperative. 
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See Table 5. 
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